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The GRID and the Linux Farm at the RCF 
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The emergence of the GRID architecture and related tools will have a large impact in the operation 
and design of present and future large clusters. We present here the ongoing efforts to equip the 
Linux Farm at the RHIC Computing Facility with Grid-like capabilities. 



I. BACKGROUND 

The RHIC Computing Facility (RCF) is a large 
scale data processing facility at Brookhaven National 
Laboratory (BNL) for the Relativistic Heavy Ion Col- 
lider (RHIC), a collider dedicated to high-energy nu- 
clear physics experiments. 

RHIC's first physics collisions occurred in the Sum- 
mer of 2000, when all four experiments began record- 
ing data from the collisions. Year 3 of RHIC opera- 
tions is currently underway. 

The RCF provides for the computational needs 
of the five RHIC experiments (BRAHMS, PHENIX, 
PHOBOS, PP2PP and STAR), including batch, mail, 
printing and data storage. In addition, BNL is the 
U.S. Tier 1 Center for ATLAS computing, and the 
RCF also provides for the computational needs of the 
U.S. collaborators in ATLAS. 

The Linux Farm at the RCF provides the majority 
of the CPU power in the RCF. It is currently listed as 
the 3 rd largest cluster, according to "Clusters Top500" 
(http://clusters.top500.org). Figure 1 shows the rapid 
growth of the Linux Farm in the last few years. 

All aspects of its development (hardware and soft- 
ware), operations and maintenance are overseen by 
the Linux Farm group, currently a staff of 5 FTE 
within the RCF. 



II. HARDWARE 



The Linux Farm is built with commercially avail- 
able thin rack-mounted, Intel-based servers (1-U and 
2-U form factors). Currently, there are 1097 dual- 
CPU production servers with approximately 917,728 
Speclnt2000. Table 1 summarizes the hardware cur- 
rently in service in the Linux Farm. Hardware reliabil- 
ity has not been an issue at the RCF. The average fail- 
ure rate is 0.0052 failures / (machine ■ month), which 
translates to 5.7 hardware failures per month at its 
present size. Hardware failures are dominated by disk 
and power supply failures. A detailed breakdown of 
the hardware failures by category is shown in Figure 
2. 



TABLE I: Linux Farm hardware 



Brand 


CPU 


RAM 


Storage 


Quantity 


VA Linux 


450 MHz 


0.5-1 GB 


9-120 GB 


154 


VA Linux 


700 MHz 


0.5 GB 


9-36 GB 


48 


VA Linux 


800 MHz 


0.5-1 GB 


18-480 GB 


168 


IBM 


1.0 GHz 


0.5-1 GB 


18-144 GB 


315 


IBM 


1.4 GHz 


1 GB 


36-144 GB 


160 


IBM 


2.4 GHz 


1 GB 


240 GB 


252 



III. SOFTWARE 

The Linux Farm at the RCF uses a custom image of 
RedHat 7.2, modified to conform to the requirements 
of the RHIC experiments and to the security protocols 
of BNL. The customized image is installed via Kick- 
Start [1], the RedHat Linux automated installation 
tool. 

The Linux Farm servers are equipped with a vari- 
ety of compilers (gec, PGI, Intel) and debuggers (gdb, 
Totalview, Intel) to provide a large degree of flexi- 
bility to its end users. In addition, the servers also 
support network file systems (AFS, NFS) and batch 
services, LSF and a RCF-designed software compat- 
ible with our MDS system. Figure 3 shows the GUI 
for the RCF-designed batch software. 

Monitoring and control of the cluster hardware, 
software and infrastructure (power and cooling) is 
provided via a mix of open-source software, RCF- 
designed software and vendor-provided software. Fig- 
ures 4 and 5 display some of our cluster monitoring 
tools, while figure 6 shows the historical temperature 
data of the cluster equipment, one of the tools we use 
to monitor the infrastructure supporting the Linux 
Farm. 



IV. SECURITY 

Because of general cyber-security standards at BNL 
and as part of the general security policy of the RCF, 
several measures were taken to insure that only au- 
thorized users can access the Linux Farm. 

A firewall has been placed around the RCF (includ- 
ing the Linux Farm) to minimize security breaches. 
In addition, users are only allowed interactive access 
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FIG. 1: The growth of the Linux Farm at the RCF. 



to the Linux Farm via dedicated gatekeeper systems 
whose software is kept up-to-date to minimize the ex- 
ploitation of security weaknesses in the software. No 
other method is provided for interactive access to the 
Linux Farm. The operating system in the Linux Farm 
servers has been modified to accomodate these secu- 
rity measures and to enhance our ability to detect 
unauthorized access. 

The RCF has also begun the deployment of a Ker- 
beros 5 single sign-on system that will eventually re- 
place our current authentication system. 



V. GRID-LIKE CAPABILITIES 

GRID-like technology has evolved from conceptual 
designs to promising prototypes with real capabilities 
in the last few years. It is a natural fit to increasingly 
powerful Linux clusters coupled with geographically 
diverse end-users and increasingly large data samples 
typical of large scale high energy & nuclear physics ex- 
periments. GRID-like technology is also making sig- 
nificant inroads into industrial applications and has 
attracted the interest and support of well-known soft- 
ware manufacturers, such as Platform Computing [2]. 

The Linux Farm has begun to investigate, install 
and support (where appropriate) prototypes of GRID- 



like software that possess capabilities that are of in- 
terest to our users, such as Ganglia, Condor and 
GLOBUS. 



A. GANGLIA 

Ganglia [3] is a full-feature, open source distributed 
monitoring software for high-performance computer 
clusters. It is based on a hierarchical design targeted 
at federation of clusters, and it supports clusters up 
to 2000 servers in size. Early prototypes are already 
equipped with a end-user Web interface, historical 
data information and clustering of remote systems. 
The monitoring data collected by ganglia can be used 
as the basis of a batch system job scheduler mecha- 
nism, although this has not yet been tested. 

The Linux Farm has deployed a ganglia prototype 
for the STAR experiment. In the prototype, the gan- 
glia collector has been configured to gather informa- 
tion from each of the nodes where a ganglia client dae- 
mon is running. This master collector then makes this 
information available to qualified external collectors. 
Figure 9 illustrates the ganglia deployment within the 
RCF. 

Security issues with this ganglia prototype are being 
investigated. Of particular concern, currently there 
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FIG. 2: Breakdown of hardware failure by category. 
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FIG. 3: GUI for RCF-designed batch software. 



FIG. 4: Alert status of Linux Farm nodes. 



is no user-friendly method to restrict the type and 
amount of information transmitted to external collec- 
tors. Wrap-around scripts written by RCF staff were 
used to restrict the information (see figures 7 and 8). 
In addition, as more servers are added to the ganglia 
master collector, scalability issues will become a ma- 
jor concern as well. The Linux Farm group plans to 
continue to test and expand the scope of the ganglia 
prototype where appropriate. 



B. CONDOR 

Condor is a open-source batch software created and 
supported by the University of Wisconsin [4] . Condor 
is a full-feature batch software that include features 
such as job queuing mechanism, configurable schedul- 
ing policy, priority scheme, checkpoint capability and 
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FIG. 5: Historical load information on the ATLAS cluster. FIG - 7: Summary view of the ganglia prototype for STAR. 
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FIG. 6: Historical temperature data in the vicinity of the 
Linux Farm equipment. 
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FIG. 8: Detailed view of a STAR node with ganglia. 



resource monitoring & management. Condor can be 
used to connect remote clusters at geographically di- 
verse locations, so it is a natural fit to the GRID com- 
putational philosophy. Condor has an interface to the 
GRID via Condor-G. 

The Linux Farm group is in the midst of upgrad- 
ing its MDS-compatible batch system to improve re- 
liability & scalability and add functionality. As part 
of the upgrade, Condor is being evaluated as a job 
scheduler for the new MDS-compatible batch system. 
Since media-based MDS systems is expected to play 
a considerable role at the RCF for the foreseable fu- 
ture, an effort is being made to integrate Condor with 
the MDS-interface API software. The current batch 
system does not have an interface to GRID-like archi- 
tectures, and Condor can add this missing functional- 
ity via Condor-G. The basic design of the new batch 
system is shown in figure 10. 

Once the prototype of the upgraded MDS- 
compatible batch system is installed, Condor scala- 
bility studies will be done to understand how perfor- 
mance is affected under the expected heavy usage. 



C. GLOBUS & LSF 

The ability of users to submit jobs to remote clus- 
ters has been one of the principal motivations for the 
Linux Farm group to explore interfacing our batch 
system with GRID-like software. 

The Linux Farm has a prototype GLOBUS [5] gate- 
keeper server that interfaces with LSF for the AT- 
LAS experiment. Authorized users at remote sites can 
submit jobs to the gatekeeper. The gatekeeper inter- 
prets the GLOBUS commands and submits jobs to the 
proper LSF queues running on the ATLAS Linux clus- 
ter at the RCF. A diagram of the prototype is shown 
in figure 11. Figure 12 shows actual LSF jobs submit- 
ted by remote users via the GLOBUS gatekeeper. 

Currently, the system is being expanded to include 
both PHENIX and STAR experiments in RHIC, and 
additional GLOBUS gatekeepers are being brought 
on-line. 
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FIG. 9: Ganglia prototype in the RCF Linux Farm. 
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FIG. 10: MDS-compatible batch system at the RCF. 
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FIG. 12: LSF batch jobs submitted from the GLOBUS 
gatekeeper. 



tools, as the outstanding issues are resolved. 

Ganglia has already gone through two upgrades 
within the RCF, and the Linux Farm group is expect- 
ing that it will become part of the standard software 
packages on all its production servers in the near fu- 
ture. 

Condor has been evaluated continuously on a small 
number of servers as the future job scheduler of the 
upgraded MDS-compatible batch system. Many of 
the oustanding issues have been resolved or are being 
studied, and we expect the batch system upgrade to 
be a year-long project. 

The Linux Farm is currently using LSF v. 4. 2, and 
we plan to upgrade it to LSF v. 5.x together with an 
OS upgrade in the next few months. New LSF fea- 
tures such as advance resource reservation and GRID 
membership protocols match well with the GRID 
computational architecture and can further integrate 
the GLOBUS gatekeepers with the LSF batch system. 



VI. NEAR-TERM PLANS 

In the near-term, the current prototypes are ex- 
pected to expand and slowly mature into production 
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